For the following exercises we will also data from Gapminder; this time on life expectancy.
As per usual, we first need to read in the data. You can just copy, paste and run the following code in(to) your script.
library(readr)
gap_life <- read_csv("../data/gapminder/life_expectancy_years.csv")
## Parsed with column specification:
## cols(
## .default = col_double(),
## country = col_character()
## )
## See spec(...) for full column specifications.
Again, the data are currently in wide format.
starts_with(). We also want to keep the country column.
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
gap_life %>%
select(country, starts_with("19"))
## # A tibble: 187 x 101
## country `1900` `1901` `1902` `1903` `1904` `1905` `1906` `1907` `1908`
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Afghan~ 29.2 29.3 29.3 29.4 29.4 29.5 29.6 29.6 29.7
## 2 Albania 35.5 35.5 35.5 35.5 35.5 35.5 35.5 35.5 35.5
## 3 Algeria 30.1 30.2 30.3 31.3 25.3 28 29.5 29.4 29.3
## 4 Andorra NA NA NA NA NA NA NA NA NA
## 5 Angola 29.5 29.6 29.7 29.8 29.9 30 30.1 30.1 30.2
## 6 Antigu~ 33.7 33.7 33.7 33.7 33.7 33.7 33.7 33.7 33.7
## 7 Argent~ 36.6 37.2 37.8 38.3 38.9 39.5 40.2 41 41.7
## 8 Armenia 35.2 35.4 35.6 35.8 36.1 36.3 36.5 36.7 36.9
## 9 Austra~ 50 50.5 51.1 51.6 52.1 52.7 53.2 53.7 54.3
## 10 Austria 41.5 42 41 40.1 40.7 41.3 42 42.6 43.2
## # ... with 177 more rows, and 91 more variables: `1909` <dbl>,
## # `1910` <dbl>, `1911` <dbl>, `1912` <dbl>, `1913` <dbl>, `1914` <dbl>,
## # `1915` <dbl>, `1916` <dbl>, `1917` <dbl>, `1918` <dbl>, `1919` <dbl>,
## # `1920` <dbl>, `1921` <dbl>, `1922` <dbl>, `1923` <dbl>, `1924` <dbl>,
## # `1925` <dbl>, `1926` <dbl>, `1927` <dbl>, `1928` <dbl>, `1929` <dbl>,
## # `1930` <dbl>, `1931` <dbl>, `1932` <dbl>, `1933` <dbl>, `1934` <dbl>,
## # `1935` <dbl>, `1936` <dbl>, `1937` <dbl>, `1938` <dbl>, `1939` <dbl>,
## # `1940` <dbl>, `1941` <dbl>, `1942` <dbl>, `1943` <dbl>, `1944` <dbl>,
## # `1945` <dbl>, `1946` <dbl>, `1947` <dbl>, `1948` <dbl>, `1949` <dbl>,
## # `1950` <dbl>, `1951` <dbl>, `1952` <dbl>, `1953` <dbl>, `1954` <dbl>,
## # `1955` <dbl>, `1956` <dbl>, `1957` <dbl>, `1958` <dbl>, `1959` <dbl>,
## # `1960` <dbl>, `1961` <dbl>, `1962` <dbl>, `1963` <dbl>, `1964` <dbl>,
## # `1965` <dbl>, `1966` <dbl>, `1967` <dbl>, `1968` <dbl>, `1969` <dbl>,
## # `1970` <dbl>, `1971` <dbl>, `1972` <dbl>, `1973` <dbl>, `1974` <dbl>,
## # `1975` <dbl>, `1976` <dbl>, `1977` <dbl>, `1978` <dbl>, `1979` <dbl>,
## # `1980` <dbl>, `1981` <dbl>, `1982` <dbl>, `1983` <dbl>, `1984` <dbl>,
## # `1985` <dbl>, `1986` <dbl>, `1987` <dbl>, `1988` <dbl>, `1989` <dbl>,
## # `1990` <dbl>, `1991` <dbl>, `1992` <dbl>, `1993` <dbl>, `1994` <dbl>,
## # `1995` <dbl>, `1996` <dbl>, `1997` <dbl>, `1998` <dbl>, `1999` <dbl>
As you may have already noticed, the dataset some missing data points. Before we start analyzing the data we might want to know for how many countries we have complete data.
drop_na() function from tidyr.
library(tidyr)
gap_life %>%
drop_na() %>%
nrow()
## [1] 184
As in the previous set of data wrangling exercises, we now want to transform the data into long format.
library(tidyr)
gap_life <- gap_life %>%
gather(-country, key = "year", value = "lifeExp")
Now let’s apply some of the advanced filtering options we discussed in the Data Wrangling - Part 2 session.
gap_life <- gap_life %>%
mutate(country = as.factor(country),
year = as.integer(year))
## Warning: NAs durch Umwandlung erzeugt
dplyr to create the first new dataframe and a specific matching operator to create the second one.
gap_life_1990s <- gap_life %>%
filter(between(year, 1990, 1999))
gap_life_1990s
## # A tibble: 0 x 3
## # ... with 3 variables: country <fct>, year <int>, lifeExp <chr>
gap_life_ger <- gap_life %>%
filter(country %in%
c("Germany", "West Germany", "East Germany"))
gap_life_ger
## # A tibble: 438 x 3
## country year lifeExp
## <fct> <int> <chr>
## 1 Germany NA 1800
## 2 Germany NA 1801
## 3 Germany NA 1802
## 4 Germany NA 1803
## 5 Germany NA 1804
## 6 Germany NA 1805
## 7 Germany NA 1806
## 8 Germany NA 1807
## 9 Germany NA 1808
## 10 Germany NA 1809
## # ... with 428 more rows
For some comparisons (especially via plots), it might help to know which continent the country is located on. For this purpose, we will create a new continent variable. As it would be quite tedious to create this variable manually for all of the countries in the dataset, we will do this only for a subset in this exercise. Just run the following code in your local script to create this subset.
gap_life_subset <- gap_life %>%
filter(country %in%
c("Netherlands", "Brazil", "China", "Algeria", "New Zealand"))
case_when() to create this new variable.
gap_life_subset %>%
mutate(continent = factor(case_when(
country == "Algeria" ~ "Africa",
country == "Brazil" ~ "Americas",
country == "China" ~ "Asia",
country == "Netherlands" ~ "Europe",
country == "New Zealand" ~ "Oceania")
))
## # A tibble: 2,190 x 4
## country year lifeExp continent
## <fct> <int> <chr> <fct>
## 1 Algeria NA 1800 Africa
## 2 Brazil NA 1800 Americas
## 3 China NA 1800 Asia
## 4 Netherlands NA 1800 Europe
## 5 New Zealand NA 1800 Oceania
## 6 Algeria NA 1801 Africa
## 7 Brazil NA 1801 Americas
## 8 China NA 1801 Asia
## 9 Netherlands NA 1801 Europe
## 10 New Zealand NA 1801 Oceania
## # ... with 2,180 more rows